Generate MODS from Scopus articles data

Author
Affiliation
Markus Skyttner

KTH Royal Institute of Technology

pkgs <- "
kthcorpus DT bslib htmltools dplyr downloadthis
" 

import <- function(x)
 x |> trimws() |> strsplit("\\s+") |> unlist() |>
  lapply(function(x) library(x, character.only = TRUE)) |> 
  invisible()

pkgs |> import()

Scopus data retrieval

The Scopus APIs for publication search and extended abstracts data can be used to retrive metadata for Scopus publications.

Recent publications from KTH

Scopus data for KTH can be retrieved from Scopus APIs. This assumes environment variables for SCOPUS_API_KEY and SCOPUS_API_INSTTOKEN are available. These need to be present in the ~/.Renviron file. Requests counts towards a ratelimit quota, which can be checked using another function.

scopus <- scopus_search_pubs_kth()
scopus_ratelimit_quota()

Due to the quota limit and since there is already a scheduled job providing the latest data, another better approach is to request the data from object storage.

scopus <- scopus_from_minio()

Extended Abstract API data

Given a specific Scopus identifier for a publication, we can use a function to retrieve additional information including for example raw affiliation strings.

# use the first id
sid <- scopus$publications$`dc:identifier` |> head(1)
scopus_abstract_extended(sid)
$scopus_abstract
# A tibble: 1 × 19
  `dc:publisher` srctype prism…¹ prism…² sourc…³ cited…⁴ prism…⁵ subtype opena…⁶
  <chr>          <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 Elsevier B.V.  j       2023-0… Journal 25349   0       872     ar      1      
# … with 10 more variables: `prism:issn` <chr>, subtypeDescription <chr>,
#   `prism:publicationName` <chr>, openaccessFlag <chr>, `prism:doi` <chr>,
#   `dc:identifier` <chr>, lang <chr>, keywords <chr>, sid <chr>,
#   `dc:description` <chr>, and abbreviated variable names ¹​`prism:coverDate`,
#   ²​`prism:aggregationType`, ³​`source-id`, ⁴​`citedby-count`, ⁵​`prism:volume`,
#   ⁶​openaccess

$scopus_authorgroup
# A tibble: 8 × 26
  sid          id        i ce_gi…¹ prefe…² prefe…³ prefe…⁴ prefe…⁵ autho…⁶ seq  
  <chr>        <chr> <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
1 SCOPUS_ID:8… 1         1 Md. Ah… Md Aha… M.A.    Habib   Habib … S00489… 1    
2 SCOPUS_ID:8… 1         2 Prosun  Prosun  P.      Bhatta… Bhatta… S00489… 5    
3 SCOPUS_ID:8… 2         1 Md. Ah… Md Aha… M.A.    Habib   Habib … S00489… 1    
4 SCOPUS_ID:8… 2         2 Md. Ab… Md Abd… M.A.    Haque   Haque … S00489… 3    
5 SCOPUS_ID:8… 2         3 Md. Mi… Md Mir… M.M.A.  Raihan  Raihan… S00489… 4    
6 SCOPUS_ID:8… 3         1 Serena  Serena  S.      Coccio… Coccio… S00489… 2    
7 SCOPUS_ID:8… 4         1 Anna    Anna    A.      Tompse… Tompse… S00489… 6    
8 SCOPUS_ID:8… 5         1 Anna    Anna    A.      Tompse… Tompse… S00489… 6    
# … with 16 more variables: ce_initials <chr>, fa <chr>, type <chr>,
#   ce_surname <chr>, auid <chr>, ce_indexed_name <chr>, country <chr>,
#   afid <chr>, country3 <chr>, city <chr>, organization <chr>,
#   affiliation_id <chr>, affiliation_instance_id <chr>, ce_source_text <chr>,
#   dptid <chr>, raw_org <chr>, and abbreviated variable names ¹​ce_given_name,
#   ²​preferred_name_ce_given_name, ³​preferred_name_ce_initials,
#   ⁴​preferred_name_ce_surname, ⁵​preferred_name_ce_indexed_name, …

$scopus_correspondence
# A tibble: 1 × 5
  sid                   ce_given_name ce_initials ce_surname ce_indexed_name
  <chr>                 <chr>         <chr>       <chr>      <chr>          
1 SCOPUS_ID:85148543994 Md. Ahasan    M.A.        Habib      Habib M.A.     

ORCiDs versus KTH identfiers

In order to automatically look up known KTH identifiers for researchers (kthids) from ORCiDs, known associations can be made available so these are known up-front.

Note

Note that this is not necessary since otherwise these are looked up on article by article basis. But it can be useful to speed up the process.

ko <- kthid_orcid()

Generating MODS for articles

Different publication types require sligthly different kinds of MODS file content.

To work with Scopus articles, filter on the publication subtype, like so:

# subtype == "cp" # conference paper
# subtype == "ar" # article
# subtype == "ch" # book chapter

articles <- scopus$publications |> filter(subtype == "ar")

To generate MODS for a specific article, we need first its Scopus identifier

sids <- articles$`dc:identifier`
sid <- sids |> head(1)

# we provide previous scopus search results and kthid_orcid pairs 
# to avoid runtime lookups for this data
mods <- sid |> scopus_mods(scopus = scopus, ko = ko)

mods |> xml2::read_xml() |> as.character() |> cat()
<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-2.xsd">
  <mods xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" version="3.7" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-7.xsd">
    <genre authority="diva" type="contentTypeCode">referee</genre>
    <genre authority="diva" type="publicationTypeCode">article</genre>
    <genre authority="svep" type="publicationType">art</genre>
    <genre authority="diva" type="publicationType" lang="eng">Article in journal</genre>
    <genre authority="kev" type="publicationType" lang="eng">article</genre>
    <name type="personal" authority="kth" href="NA">
      <namePart type="family">Habib</namePart>
      <namePart type="given">Md. Ahasan</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation><![CDATA[KTH-International Groundwater Arsenic Research Group, Department of Sustainable Development, Environmental Science and Engineering, KTH Royal Institute of Technology, Stockholm, Sweden; NGO Forum for Public Health, Dhaka, Bangladesh]]></affiliation>
    </name>
    <name type="personal" authority="kth" href="NA">
      <namePart type="family">Cocciolo</namePart>
      <namePart type="given">Serena</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation><![CDATA[World Bank, Washington D.C., USA]]></affiliation>
    </name>
    <name type="personal" authority="kth" href="NA">
      <namePart type="family">Haque</namePart>
      <namePart type="given">Md. Abdul</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation><![CDATA[NGO Forum for Public Health, Dhaka, Bangladesh]]></affiliation>
    </name>
    <name type="personal" authority="kth" href="NA">
      <namePart type="family">Raihan</namePart>
      <namePart type="given">Md. Mir Abu</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation><![CDATA[NGO Forum for Public Health, Dhaka, Bangladesh]]></affiliation>
    </name>
    <name type="personal" authority="kth" href="NA">
      <namePart type="family">Bhattacharya</namePart>
      <namePart type="given">Prosun</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation><![CDATA[KTH-International Groundwater Arsenic Research Group, Department of Sustainable Development, Environmental Science and Engineering, KTH Royal Institute of Technology, Stockholm, Sweden]]></affiliation>
    </name>
    <name type="personal" authority="kth" href="NA">
      <namePart type="family">Tompsett</namePart>
      <namePart type="given">Anna</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation><![CDATA[Institute for International Economic Studies, Stockholm University, Sweden; Beijer Institute for Ecological Economics, Royal Academy of Sciences, Sweden]]></affiliation>
    </name>
    <titleInfo lang="eng">
      <title>How to clean a tubewell: the effectiveness of three approaches in reducing coliform bacteria</title>
    </titleInfo>
    <originInfo>
      <publisher>Elsevier B.V.</publisher>
      <dateIssued>2023</dateIssued>
      <dateOther type="availableFrom">10 May 2023</dateOther>
    </originInfo>
    <physicalDescription>
      <form authority="marcform">print</form>
    </physicalDescription>
    <identifier type="doi">10.1016/j.scitotenv.2023.161932</identifier>
    <identifier type="scopus">2-s2.0-85148543994</identifier>
    <identifier type="eissn">18791026</identifier>
    <identifier type="issn">00489697</identifier>
    <identifier type="articleId">161932</identifier>
    <typeOfResource>text</typeOfResource>
    <location>
      <url>https://api.elsevier.com/content/abstract/scopus_id/85148543994</url>
    </location>
    <subject lang="eng">
      <topic>Cleaning/maintenance</topic>
      <topic>Coliform bacteria</topic>
      <topic>Deep tubewells</topic>
      <topic>Disinfection</topic>
      <topic>Drinking water</topic>
    </subject>
    <abstract lang="eng"><![CDATA[Access to safe drinking water in rural Bangladesh remains a perpetual challenge. Most households are exposed to either arsenic or faecal bacteria in their primary source of drinking water, usually a tubewell. Improving tubewell cleaning and maintenance practices might reduce exposure to faecal contamination at a potentially low cost, but whether current cleaning and maintenance practices are effective remains uncertain, as does the extent to which best practice approaches might improve water quality. We used a randomized experiment to evaluate how effectively three approaches to cleaning a tubewell improved water quality, measured by total coliforms and E. coli. The three approaches comprise the caretaker's usual standard of care and two best-practice approaches. One best-practice approach, disinfecting the well with a weak chlorine solution, consistently improved water quality. However, when caretakers cleaned the wells themselves, they followed few of the steps involved in the best-practice approaches, and water quality declined rather than improved, although the estimated declines are not consistently statistically significant. The results suggest that, while improvements to cleaning and maintenance practices might help reduce exposure to faecal contamination in drinking water in rural Bangladesh, achieving widespread adoption of more effective practices would require significant behavioural change.]]></abstract>
    <note>Imported from Scopus. VERIFY.</note>
    <relatedItem type="host">
      <titleInfo>
        <title>Science of the Total Environment</title>
      </titleInfo>
      <identifier type="issn">00489697</identifier>
      <part>
        <detail type="volume">
          <number>872</number>
        </detail>
        <detail type="issue">
          <number>NA</number>
        </detail>
        <extent>
          <start>NA</start>
          <end>NA</end>
        </extent>
      </part>
    </relatedItem>
    <!--    <note type="funder">@Funder@ [@project_number_from_funder@]</note> -->
  </mods>
</modsCollection>

The scopus_mods_crawl() function is vectorised which means it can iterate over several Scopus identifiers

my_sids <- sids |> head(10)

my_mods <- my_sids |> scopus_mods_crawl(scopus = scopus, ko = ko)
Generating MODS parameters for 10 identifiers...
Generating MODS based on parameters...
Returning 10 MODS
names(my_mods)
 [1] "SCOPUS_ID:85148543994" "SCOPUS_ID:85148537060" "SCOPUS_ID:85148534117"
 [4] "SCOPUS_ID:85148499622" "SCOPUS_ID:85148520239" "SCOPUS_ID:85148532021"
 [7] "SCOPUS_ID:85148504588" "SCOPUS_ID:85148528636" "SCOPUS_ID:85148524172"
[10] "SCOPUS_ID:85148505140"
my_mods$`SCOPUS_ID:85147171092` |> cat()

A zip-file with the results can be generated, and included for download in a quarto doc.

zf <- write_mods_zip(my_mods, path = "~/temp/modz")
Generating zip file at ~/temp/modz/mods.zip
download_file(
  path = zf,
  output_name = "Files from downloadthis",
  button_label = "Download files",
  has_icon = TRUE,
  icon = "fa fa-save",
  self_contained = TRUE
)